Split out the map/rank/bind into a separate man#557
Split out the map/rank/bind into a separate man#557jjhursey wants to merge 4 commits intoopenpmix:masterfrom
Conversation
|
I'm working on a separate man page dedicated to the map/rank/bind functionality. This PR is definitely not ready, but I wanted to post it for the community to mark progress towards this goal for the v2 release. Anyone wanting to help feel free to reach out. |
00182a9 to
17f9c0f
Compare
|
|
|
|
|
Items that I removed from the The discussion about when to use I noticed that
But this does not work This |
|
I noticed in this example (testing --np 6): Before (from mpirun) the --map-by node line was as follows (round-robin): But now it is (sequential): I was running with: Is this a change in behavior or a typo in the original? |
|
If the nodes are oversubscribed the binding report is empty. Should we print out something like "MCW rank 0 is unbound" or "MCW rank 0 bound to nothing"? |
|
The It says the following, but I'm seeing that "max_slots" is being ignored and the extra processes are on Limits to oversubscription can also be specified in the hostfile itself: The
: causes the first 12 processes to be launched as before, but the |
|
The sequential mapper thew an unexpected error without the |
|
Left to do:
I'm off next week (back June 1) - the community can feel free to push updates to my branch if they want to help with the pages. Otherwise I'll keep working on this when I get back. |
|
I think we are going to hit a lot of confusion if we aren't careful here. First, you cannot execute the cmds as you are showing them here: $ prte --hostfile hostfile.txt --prtemca rmaps seq /bin/trueThe So I have to assume that the errors you are reporting are from you actually running those cmds using something like |
|
I fixed the sequential mapper, and I also added a new |
|
Actually I was running For the PRRTE man pages I want them to reflect the PRRTE behavior without any personality. Then we can have separate sections for various personalities or something. |
|
I tested a number of variations of the --map-by, --bind-to and --rank-by options with prterun and found the following problems where it appears the documentation (prte-map.1.md) should be updated. The cluster I tested this with had 4 Power 8 nodes with 2 packages (sockets) per node, each with 10 cores and each core with 8 hwthreads (160 total hwthreads). The launch/local node was a Power 9 node The naming convention for hostfiles is hostfile where is the number of slots specified with the slots= keyword, and where the hostfile lists the 4 nodes in the cluster.
In all the above cases, if you want PRTE to default to the number I added the --use-hwthread-cpus option and got an error message stating that was an unknown option.
The help text told me that numa was one of the allowed choices. When I replaced slot with numa I got a different error message telling me that numa was invalid.
The same error occurs for **prun –host c712f6n01,c712f6n02 –np 8 ./a.out ** where the text says the additional 6 tasks should be allocated to these two nodes as well. This does not work, as expected with prterun unless an oversubscribe option is specified.
|
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* Remove - `pmixam` since it wasn't being processed * Add - `gmca` - `gprtemca` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* Still lots to cleanup and verify here. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
|
I'm not sure if this is a documentation problem or a code problem, but if I run the command prterun -n 24 --hostfile8 --bind-to slot taskinfo I get a message that the binding policy slot is not recognized. |
Correct - "slot" has no physical meaning, so we cannot bind you to it. |
|
Per #696 clarify that |
I don't think that sentence makes sense, nor do I think that is what is happening. The |
|
Replaced by PR #773 |
No description provided.